A new term is making its way through the grapevine via conferences, Slack channels and Microsoft teams: observability. Just like DevOps, observability has the potential to turn conventional role models on their head and make a significant impact on IT. Observability brings transparency to application landscapes and, among other things, shifts responsibility for application monitoring towards application developers. Ideally, all members of the development team work together with operations towards the common goal of observability.

STAY TUNED

Learn more about DevOpsCon

The term observability originally comes from control theory which goes back to Rudolf E. Kálmán. As early as 1960, he defined observability as a property of mathematical systems that describes how well one can deduce the internal state of a system based on its outputs. Applied to modern software landscapes, this means that we want to be able to:

Understand the internal state of an application,

Understand how an application has maneuvered itself into its current state,

Achieve this for all applications and infrastructure components

All of this with the help of external tools exclusively.

These principles give rise to direct challenges for our applications, but also for our development team and the organization:

How do applications and existing infrastructure components provide data?

How do I collect this data and make it available for further analysis?

Who in the development team and/or the organization benefits from which data?

Does everyone in the organization have access to the data they need?

The bar for achieving a state of complete observability is extremely high. Not only the company’s own applications, but also all infrastructure components must continuously comply with the principles outlined above. As applications and infrastructure are constantly evolving, the goal of observability is also a moving target. This is why, like DevOps, observability should be understood as a sporting and philosophical mindset that influences all areas of software development and infrastructure. Observability is more of a path than a goal.

But we have monitoring, don’t we?

Traditional monitoring often breaks down into silos such as IT infrastructure operations and application development – everyone monitors the systems they know and support. No one monitors an application end-to-end, and in the event of a failure, people shrug their shoulders and point to the other silo. There is no linking of data from the individual silos for sustainable troubleshooting.

When we talk about monitoring systems, our goal is to:

Monitor a wide variety of systems and applications – spring-boot applications, web servers, switches and auto-scaling groups.
We only want to store raw data and not aggregations; we only want to create these when necessary and analyze them at the raw data level in the event of an error.
Merge and correlate data from different sources.
Make this data available to anyone who needs it.

Looking at these requirements, we can see that the problems of traditional monitoring are not of an organizational nature alone. We have to realize that existing monitoring systems rarely meet these requirements. Therefore it’s time to explore new solutions capable of fulfilling these requirements.

Charity Majors, the “Queen of Observability”, insists that observability needs to “be able to ask the unknown unknowns”. We should then be able to ask (as of yet) unknown questions about unknown problems on (as of yet) unknown data. The online mail order company Etsy came up with a solution to this paradox 11 years ago with its “measure anything, measure everything” approach. Revolutionary and groundbreaking at the time, the world has since moved on from simple software architecture to microservices in multi-cloud Kubernetes environments. This trend is causing the complexity of our applications to explode. However, existing monitoring tools are usually built for foreseeable problems and not for “unknown unknowns”.

An essential component of observability is the storage of raw analysis data instead of aggregates. Traditional monitoring aggregates key figures into metrics or access times into latency histograms. A good example of a metric is, for example, the number of failed log-ins. For observability, on the other hand, I can understand the reasons for each individual failed log-in for each user. In fact, we want to be able to break down each metric into its raw data and analyze it in its respective context (user, request, session) and aggregate it again.

Taken together, this means that we need to completely rethink the topic of monitoring in order to move towards observability.

Cut to the chase

But what does observability actually mean? Are there standards and tools? Where do I start? If you look at the tools and techniques currently available, observability can be based on the following three pillars:

Log management is essential in distributed environments and describes how all log outputs from my applications are collected and stored centrally in a searchable format. As much metadata as possible is written to each log line (host, cloud, operating system, application version). In addition, filter criteria are extracted from log lines in order to form aggregations (HTTP status codes, log level). Exemplary systems include Loki, Graylog and the ELK stack.
Metrics are application-internal counters or histograms that can be read out via an interface. Systems such as Prometheus extract metrics from the applications (pull), whereas metrics are actively sent to Graphite (push). Metrics are also provided with metadata so that they can be aggregated or correlated later.

Tracing maps call hierarchies within and between applications. These can be recorded within a JVM, e.g. by Java agents, or via proxy services in a mesh network. Ideally, both data sources can be combined to form a comprehensive trace. Exemplary systems are the APM agent from Elastic or OpenTelemetry-based agents. Traces of JavaScript applications are collected and transmitted in the browser (real user monitoring). A special form of tracing is the exclusive recording of error traces. This reduces the amount of data recorded, but also the possibility of correlation. Sentry is one such tool.

The problem with these three pillars is that, to be more precise, they are three silos with data that is difficult or impossible to correlate. As soon as I detect an anomaly in my metrics, I can use other metrics for correlation. The jump to corresponding log messages or traces is made more difficult by a change of silo. Although I can narrow down the problematic period in each tool using the timestamps, they are unfortunately not really compatible. If my metadata in the silos doesn’t match (in terms of name or content), I start searching from scratch in each silo. Unfortunately, there is hardly any technical solution that can break down these silos and provide a holistic view of my analysis data. We can still expect a lot of technological development in this area. In fact, commercial providers are currently making inroads into this area (box: “Commercial observability SaaS solutions”).

Commercial observability SaaS solutions

By using commercial observability SaaS solutions, you can achieve an understanding of your application incredibly quickly. In the Java environment, a Java agent is usually started with the JVM, which instruments method calls. The agents have knowledge of the common IoC frameworks (Spring, Quarkus) and can therefore break down an application very well into web requests, middleware and database access, for example. Custom instrumentation and annotations with metadata (current user, etc.) are also possible. If you don’t have an observability solution in place beforehand, you quickly get the feeling that you have just launched the USS Enterprise. However, these solutions reach their limits as soon as you need individual solutions or dimensions in your analysis data. Then it is usually time to switch to a self-hosted open source solution. Most providers offer a free trial period during which you can put the solution through its paces in your own infrastructure. Recommended SaaS providers are Honeycomb, New Relic, DataDog and Elastic APM.

In view of the shortcomings of existing solutions, however, we don’t want to bury our heads in the sand (box: “Open source observability solutions”). Because even a small observability solution is better than none at all! After all, our aim is to achieve security in the operation of our application and to get to the root of the problem quickly and reliably in the event of problems. The operation of applications is a journey on which we gain new insights with every malfunction. We shine a light on new problem areas and collect additional related data in the form of metrics, traces and logs. These help us to analyze the next problem.

Open source observability solutions

Open source solutions offer the greatest flexibility for customization. The following tools are worth a look:

Log management

Elastic Stack (Elasticsearch and Kibana)
Loki as a new approach, based on Prometheus
Graylog as an enterprise alternative to the ELK stack

Metrics

Prometheus with its various exporters
Graphite (only for historical reasons)

Tracing

Jaeger, Grafana Tempo for request tracing
Sentry for error tracing

Observability patterns and anti-patterns

We want to carry out all error analyses without requiring additional deployment. Therefore, relying solely on activating an agent or debugger at the time of an error, or focusing metric collection only on suspected cases, would be a clear anti-pattern.This is crucial because certain errors might not reappear identically, and often the application’s behavior leading up to the error is critical for analysis. However, the problem may also lie in the hypervisor of the virtual machine, as this was moved from NVMe to “spinning rust” at high IOPS – i.e. completely outside our application. It is therefore important to collect all metrics consistently and as granular as possible – ideally in raw form. Let’s stick with this example and explore different approaches with the debugger and observability:

With the debugger and basic metrics: We detect a peak in the response times of our application. We filter on the affected endpoint and analyze the average response times as well as the 90th and 99th percentile. We recognize outliers in the 99th percentile and look for a recorded request (trace). We take its call parameters and recreate the request in the debugger.
With continuous observability: We recognize a peak in the response times of our application. We filter on the affected endpoint and analyze the average response times as well as the 90th and 99th percentile. We break down the requests (traces) by target host and recognize that only requests on a database replica are slow. It smells like a hardware problem. We can confirm this by looking at the host’s hardware metrics and see that it is running under 100 percent IOPS load. A look at the hypervisor metrics shows that this replica has been migrated to a slower datastore.

With Observability, we got to the root of the problem much faster. In this case, we would have wasted hours using the debugger. In the worst case, we would have found other supposed problems and would not have gotten to the actual problem at all. Supposed fixes would have increased the complexity of the code and we would have accumulated more technical debt. Observability helps us to maintain a clear overview.

When setting up an observability infrastructure, the following patterns have emerged as useful for clearly separating the application from the solution used:

Applications provide metrics about internal states via an API. The application is not responsible for writing the metrics to a database. A third-party application collects the metrics via an interface and writes them to the database. Prometheus is a very good example of this type of pull metrics. The Prometheus data format is simple and the application has no dependency on Prometheus. In Spring Boot, this can be implemented via the Actuator framework.
If third-party applications or infrastructure components do not provide their metrics via an API, these are adapted via exporters. The exporter “translates” the proprietary format of the application into Prometheus metrics, for example. This also allows metrics from cloud providers (e.g. AWS, DigitalOcean) to be pulled into the Prometheus database.
Tracing of function calls as well as request tracing between components is recorded transparently for the application. To record internal function calls, an agent can be started with the JVM that instruments the application code. Transparent reverse proxies that instrument the calls between components can be used to record requests between components. In Kubernetes environments, this can be implemented using sidecar containers.

Another anti-pattern is the lack of a single source of truth within a silo of metrics, logs and traces. Cloud providers sometimes provide their own metrics or logging solutions (e.g. CloudWatch). Metrics from the cloud provider’s infrastructure end up there. However, this data cannot then be correlated (without great effort) with data in other data pools, e.g. those of a commercial provider or those of Prometheus. We must therefore ensure that there is only one database, a single source of truth, at least within a silo.

This single source of truth must then also be accessible to everyone who needs it to analyze a problem. In the example above, we have seen that this can also be the case across teams or departments. Access should also not be limited to technical personnel. It is extremely important to be completely transparent with product owners as well.

Observability-driven development

In [8], Charity Majors et. al propagate a “shift left” for observability and define the content of observability-driven development. Shift left means moving part of the development process as well as knowledge, forward in time, i.e. to the left on a Kanban board. This is based on a simple but important idea: “It is never as easy to debug a problem as immediately after the code has been written and deployed”. This is why observability should already be taken into account during the development of applications.

Testing the instrumentation of APM agents before going live: APM agents instrument common frameworks (Spring, Quarkus) automatically. However, it often makes sense to instrument your own code manually. This instrumentation must be tested.
Exporting critical metrics, including business metrics: Standard metrics such as request histograms or JVM heap metrics are automatically exported by the Actuator framework, for example. However, metrics per feature and overarching business metrics are really interesting. Ideally, these metrics can be used to directly answer the question of whether you are currently earning money.
Feature flags for every pull request: We accept every pull request with the question: “How do I recognize that this feature is breaking?” Problems with a new feature can be verified or falsified with feature flags.
We deploy feature by feature accordingly: we mark each deployment in our analysis data so that we can draw direct conclusions about the software version used.

Systems that are easy to understand and whose features I can switch on or off as required, burn considerably less time when troubleshooting. With these simple patterns and consistent observability, the mean time to recovery can be significantly reduced. We avoid almost endless cycles of creating debugging code and deployments. The application code remains clean and we don’t get into that time-consuming downward spiral that I like to call a “witch hunt”.

Should I set up an observability team now?

The answer to this question is a clear yes and no. Observability is part of application development. Metrics must be exported during development and the operations aspect must be considered from the outset (including readiness and health checks). APM agents must be instrumented correctly and with the necessary level of detail, and the instrumentation must be tested. To match metrics, traces and log outputs, all outputs must be enriched with metadata. This enrichment can only be partially done outside the application (e.g. details about the host or cloud provider). The comprehensive production readiness checklists from Zalando and Google are recommended for this topic.

On the other hand, it does not make much sense to have each team build and maintain the infrastructure of its own observability stack. If there is a platform team in the organization, it makes sense to set up a dedicated observability team. On the one hand, this team can operate observability infrastructure, but on the other hand, it can also help application developers to increase their level of observability. Although teams act autonomously according to the textbook, separate observability solutions should be avoided at all costs, as this creates new data silos: the analysis data cannot be correlated with that of other teams.

It should definitely not be the responsibility of an observability team to instrument the code of other teams or extract metrics from it. This must remain within the technical context of the development team.

The transparency about application and infrastructure internals that comes with observability is a great boon for troubleshooting. However, this transparency also has an impact on the organization itself:

Not everyone is enthusiastic about this level of transparency, as it also provides points of attack. A culture of trust and mutual respect is important.
How do we deal with disruptions, to whom are alarms forwarded? Who has to get up at 3 a.m. to clear faults? Where faults otherwise occur in server operation, they can be routed directly to the responsible development team with the help of the right information.
How are such new requirements regulated under labor law? Where regulations for 24/7 operation have always been in force in server operations, this suddenly also affects application development.

Every organization is different, but the bottom line is that transparency and observability will also be extremely helpful in breaking down walls in these areas and significantly reducing the mean time to recovery, for example. We waste less time moving problems between teams or debugging applications. We make decisions based on facts and not on assumptions!

In the other articles in the focus section (box: “In-depth observability articles in this issue”), we go into the underlying idea of observability and show how it differs from traditional monitoring approaches. Above all, however, we offer very practical tips on how this approach can be rolled out in your applications.

Links & Literature

[1] Kálmán, Rudolf E.: „On the general theory of control systems“: https://www.sciencedirect.com/science/article/pii/S1474667017700948

[2] Charity Majors’ Blog: https://charity.wtf/category/observability/

[3] Etsys Code-as-Craft-Blog: https://codeascraft.com/2011/02/15/measure-anything-measure-everything/

[4] https://www.honeycomb.io

[5] https://newrelic.com

[6] https://www.datadoghq.com

[7] https://www.elastic.co/de/observability/application-performance-monitoring

[8] Majors, Charity; Fong-Jones, Liz; Miranda, George: „Observability Engineering“; O’Reilly, 2022

[9] Forsgren, Nicole; Humble, Jez; Kim, Gene: „Accelerate“; IT Revolution Press, 2018; https://itrevolution.com/measure-software-delivery-performance-four-key-metrics/

[10] Jacobs, Henning: „[Zalandos] Production Checklist for Webapps on Kubernetes“: https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html

[11] Dogan, Jaana: „Google Cloud Production Guideline“: https://medium.com/google-cloud/production-guideline-9d5d10c8f1e

Blog

The Observability Myth

Finally, sleep through the night thanks to observability!

Download the NEW DevOps Magazine

Categories

Behind the Tracks

Kubernetes Ecosystem

Microservices & Software Architecture

Continuous Delivery & Automation

Cloud Platforms & Serverless

Monitoring, Traceability & Diagnostics

Security

Business & Company Culture

Organizational Change

Live Demo #slideless